── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom 1.0.7 ✔ rsample 1.2.1
✔ dials 1.4.0 ✔ tune 1.3.0
✔ infer 1.0.7 ✔ workflows 1.2.0
✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
✔ parsnip 1.3.0 ✔ yardstick 1.3.2
✔ recipes 1.1.1
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter() masks stats::filter()
✖ recipes::fixed() masks stringr::fixed()
✖ dplyr::lag() masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step() masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
library (powerjoin)
library (glue)
library (vip)
Attaching package: 'vip'
The following object is masked from 'package:utils':
vi
root <- 'https://gdex.ucar.edu/dataset/camels/file'
download.file ('https://gdex.ucar.edu/dataset/camels/file/camels_attributes_v2.0.pdf' ,
'data/camels_attributes_v2.0.pdf' )
types <- c ("clim" , "geol" , "soil" , "topo" , "vege" , "hydro" )
remote_files <- glue ('{root}/camels_{types}.txt' )
local_files <- glue ('data/camels_{types}.txt' )
walk2 (remote_files, local_files, download.file, quiet = TRUE )
camels <- map (local_files, read_delim, show_col_types = FALSE )
camels <- power_full_join (camels ,by = 'gauge_id' )
Question 1
The zero_q_freq variable is the frequency of days where Q = 0 mm/day as a percentage, with Q being discharge, a measure of the volume of water flowing past a point on a stream.
Exploratory Data Analysis
ggplot (data = camels, aes (x = gauge_lon, y = gauge_lat)) +
borders ("state" , colour = "gray50" ) +
geom_point (aes (color = q_mean)) +
scale_color_gradient (low = "pink" , high = "dodgerblue" ) +
ggthemes:: theme_map ()
Question 2
camels_long <- camels %>%
pivot_longer (cols = c (aridity, p_mean),
names_to = "variable" ,
values_to = "value" )
ggplot (data = camels_long, aes (x = gauge_lon, y = gauge_lat)) +
borders ("state" , colour = "gray50" ) +
geom_point (aes (color = value)) +
scale_color_gradient (low = "pink" , high = "dodgerblue" ) +
ggthemes:: theme_map () +
facet_wrap (~ variable, scales = "free" ) +
scale_color_gradient (low = "pink" , high = "dodgerblue" ) +
labs (
title = "Maps of Aridity and P-Mean" ,
x = "Longitude" ,
y = "Latitude" ,
color = "Measurement Value"
)
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Model Preparation & Building
camels |>
select (aridity, p_mean, q_mean) |>
drop_na () |>
cor ()
aridity p_mean q_mean
aridity 1.0000000 -0.7550090 -0.5817771
p_mean -0.7550090 1.0000000 0.8865757
q_mean -0.5817771 0.8865757 1.0000000
ggplot (camels, aes (x = aridity, y = p_mean)) +
geom_point (aes (color = q_mean)) +
geom_smooth (method = "lm" , color = "red" , linetype = 2 ) +
scale_color_viridis_c () +
theme_linedraw () +
theme (legend.position = "bottom" ) +
labs (title = "Aridity vs Rainfall vs Runnoff" ,
x = "Aridity" ,
y = "Rainfall" ,
color = "Mean Flow" )
`geom_smooth()` using formula = 'y ~ x'
ggplot (camels, aes (x = aridity, y = p_mean)) +
geom_point (aes (color = q_mean)) +
geom_smooth (method = "lm" ) +
scale_color_viridis_c () +
scale_x_log10 () +
scale_y_log10 () +
theme_linedraw () +
theme (legend.position = "bottom" ) +
labs (title = "Aridity vs Rainfall vs Runnoff" ,
x = "Aridity" ,
y = "Rainfall" ,
color = "Mean Flow" )
`geom_smooth()` using formula = 'y ~ x'
ggplot (camels, aes (x = aridity, y = p_mean)) +
geom_point (aes (color = q_mean)) +
geom_smooth (method = "lm" ) +
scale_color_viridis_c (trans = "log" ) +
scale_x_log10 () +
scale_y_log10 () +
theme_linedraw () +
theme (legend.position = "bottom" ,
legend.key.width = unit (2.5 , "cm" ),
legend.key.height = unit (.5 , "cm" )) +
labs (title = "Aridity vs Rainfall vs Runnoff" ,
x = "Aridity" ,
y = "Rainfall" ,
color = "Mean Flow" )
`geom_smooth()` using formula = 'y ~ x'
set.seed (123 )
camels <- camels |>
mutate (logQmean = log (q_mean))
camels_split <- initial_split (camels, prop = 0.8 )
camels_train <- training (camels_split)
camels_test <- testing (camels_split)
camels_cv <- vfold_cv (camels_train, v = 10 )
rec <- recipe (logQmean ~ aridity + p_mean, data = camels_train) %>%
step_log (all_predictors ()) %>%
step_interact (terms = ~ aridity: p_mean) |>
step_naomit (all_predictors (), all_outcomes ())
baked_data <- prep (rec, camels_train) |>
bake (new_data = NULL )
lm_base <- lm (logQmean ~ aridity * p_mean, data = baked_data)
summary (lm_base)
Call:
lm(formula = logQmean ~ aridity * p_mean, data = baked_data)
Residuals:
Min 1Q Median 3Q Max
-2.91162 -0.21601 -0.00716 0.21230 2.85706
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.77586 0.16365 -10.852 < 2e-16 ***
aridity -0.88397 0.16145 -5.475 6.75e-08 ***
p_mean 1.48438 0.15511 9.570 < 2e-16 ***
aridity:p_mean 0.10484 0.07198 1.457 0.146
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5696 on 531 degrees of freedom
Multiple R-squared: 0.7697, Adjusted R-squared: 0.7684
F-statistic: 591.6 on 3 and 531 DF, p-value: < 2.2e-16
summary (lm (logQmean ~ aridity + p_mean + aridity_x_p_mean, data = baked_data))
Call:
lm(formula = logQmean ~ aridity + p_mean + aridity_x_p_mean,
data = baked_data)
Residuals:
Min 1Q Median 3Q Max
-2.91162 -0.21601 -0.00716 0.21230 2.85706
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.77586 0.16365 -10.852 < 2e-16 ***
aridity -0.88397 0.16145 -5.475 6.75e-08 ***
p_mean 1.48438 0.15511 9.570 < 2e-16 ***
aridity_x_p_mean 0.10484 0.07198 1.457 0.146
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5696 on 531 degrees of freedom
Multiple R-squared: 0.7697, Adjusted R-squared: 0.7684
F-statistic: 591.6 on 3 and 531 DF, p-value: < 2.2e-16
test_data <- bake (prep (rec), new_data = camels_test)
test_data$ lm_pred <- predict (lm_base, newdata = test_data)
metrics (test_data, truth = logQmean, estimate = lm_pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.583
2 rsq standard 0.742
3 mae standard 0.390
ggplot (test_data, aes (x = logQmean, y = lm_pred, colour = aridity)) +
scale_color_gradient2 (low = "brown" , mid = "orange" , high = "darkgreen" ) +
geom_point () +
geom_abline (linetype = 2 ) +
theme_linedraw () +
labs (title = "Linear Model: Observed vs Predicted" ,
x = "Observed Log Mean Flow" ,
y = "Predicted Log Mean Flow" ,
color = "Aridity" )
lm_model <- linear_reg () %>%
set_engine ("lm" ) %>%
set_mode ("regression" )
lm_wf <- workflow () %>%
add_recipe (rec) %>%
add_model (lm_model) %>%
fit (data = camels_train)
summary (extract_fit_engine (lm_wf))$ coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.7758557 0.16364755 -10.851710 6.463654e-25
aridity -0.8839738 0.16144589 -5.475357 6.745512e-08
p_mean 1.4843771 0.15511117 9.569762 4.022500e-20
aridity_x_p_mean 0.1048449 0.07198145 1.456555 1.458304e-01
summary (lm_base)$ coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.7758557 0.16364755 -10.851710 6.463654e-25
aridity -0.8839738 0.16144589 -5.475357 6.745512e-08
p_mean 1.4843771 0.15511117 9.569762 4.022500e-20
aridity:p_mean 0.1048449 0.07198145 1.456555 1.458304e-01
lm_data <- augment (lm_wf, new_data = camels_test)
dim (lm_data)
metrics (lm_data, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.583
2 rsq standard 0.742
3 mae standard 0.390
ggplot (lm_data, aes (x = logQmean, y = .pred, colour = aridity)) +
scale_color_viridis_c () +
geom_point () +
geom_abline () +
theme_linedraw ()
library (baguette)
rf_model <- rand_forest () %>%
set_engine ("ranger" , importance = "impurity" ) %>%
set_mode ("regression" )
rf_wf <- workflow () %>%
add_recipe (rec) %>%
add_model (rf_model) %>%
fit (data = camels_train)
rf_data <- augment (rf_wf, new_data = camels_test)
dim (rf_data)
metrics (rf_data, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.587
2 rsq standard 0.741
3 mae standard 0.363
ggplot (rf_data, aes (x = logQmean, y = .pred, colour = aridity)) +
scale_color_viridis_c () +
geom_point () +
geom_abline () +
theme_linedraw ()
wf <- workflow_set (list (rec), list (lm_model, rf_model)) %>%
workflow_map ('fit_resamples' , resamples = camels_cv)
autoplot (wf)
rank_results (wf, rank_metric = "rsq" , select_best = TRUE )
# A tibble: 4 × 9
wflow_id .config .metric mean std_err n preprocessor model rank
<chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
1 recipe_rand_fore… Prepro… rmse 0.563 0.0247 10 recipe rand… 1
2 recipe_rand_fore… Prepro… rsq 0.771 0.0259 10 recipe rand… 1
3 recipe_linear_reg Prepro… rmse 0.569 0.0260 10 recipe line… 2
4 recipe_linear_reg Prepro… rsq 0.770 0.0223 10 recipe line… 2
Question 3
# Boost Tree
bt_model <- boost_tree () %>%
set_engine ("xgboost" ) %>%
set_mode ("regression" )
bt_wf <- workflow () %>%
add_recipe (rec) %>%
add_model (bt_model) %>%
fit (data = camels_train)
bt_data <- augment (bt_wf, new_data = camels_test)
dim (bt_data)
metrics (bt_data, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.631
2 rsq standard 0.702
3 mae standard 0.397
ggplot (bt_data, aes (x = logQmean, y = .pred, colour = aridity)) +
scale_color_viridis_c () +
geom_point () +
geom_abline () +
theme_linedraw ()
# Neural Network
nnet_model <- bag_mlp () %>%
set_engine ("nnet" ) %>%
set_mode ("regression" )
nnet_wf <- workflow () %>%
add_recipe (rec) %>%
add_model (nnet_model) %>%
fit (data = camels_train)
nnet_data <- augment (nnet_wf, new_data = camels_test)
dim (nnet_data)
metrics (nnet_data, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.545
2 rsq standard 0.772
3 mae standard 0.337
ggplot (nnet_data, aes (x = logQmean, y = .pred, colour = aridity)) +
scale_color_viridis_c () +
geom_point () +
geom_abline () +
theme_linedraw ()
# Add to existing workflow
wf <- workflow_set (list (rec), list (lm_model, rf_model, nnet_model, bt_model)) %>%
workflow_map ('fit_resamples' , resamples = camels_cv)
autoplot (wf)
rank_results (wf, rank_metric = "rsq" , select_best = TRUE )
# A tibble: 8 × 9
wflow_id .config .metric mean std_err n preprocessor model rank
<chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
1 recipe_bag_mlp Prepro… rmse 0.547 0.0282 10 recipe bag_… 1
2 recipe_bag_mlp Prepro… rsq 0.786 0.0234 10 recipe bag_… 1
3 recipe_rand_fore… Prepro… rmse 0.564 0.0260 10 recipe rand… 2
4 recipe_rand_fore… Prepro… rsq 0.770 0.0266 10 recipe rand… 2
5 recipe_linear_reg Prepro… rmse 0.569 0.0260 10 recipe line… 3
6 recipe_linear_reg Prepro… rsq 0.770 0.0223 10 recipe line… 3
7 recipe_boost_tree Prepro… rmse 0.600 0.0289 10 recipe boos… 4
8 recipe_boost_tree Prepro… rsq 0.745 0.0268 10 recipe boos… 4
Compared to the other two models, the Neural Network model outperforms the other models and the Xgboost Regression model underperforms the other models. The linear model is still ranked above the Xgboost model and this could be because the tree-based method of the Xgboost model was not able to capture complexities in the relatively small dataset. However, the linear model was outperformed by the random forest model and neural network, suggesting a simple linear relationship is not sufficient to explain the interactions between terms. The flexible nature of the Neural Network model and its ability to capture complexities of the dataset and learn from complex interactions makes it the best model, as demonstrated in the above table and graph. For these reasons I would continue with the Neural Network model.
Build Your Own
#Data Splitting
set.seed (124 )
camels <- camels %>%
mutate (logQmean = log (q_mean)) %>%
select (logQmean, p_mean, aridity, soil_depth_pelletier, max_water_content, organic_frac, frac_snow, pet_mean, soil_depth_statsgo, elev_mean) %>%
drop_na ()
c_split <- initial_split (camels, prop = 0.8 )
c_train <- training (c_split)
c_test <- testing (c_split)
c_cv <- vfold_cv (c_train, v = 10 )
# Recipe
rec2 <- recipe (logQmean ~ . , data = c_train) %>%
step_scale (all_predictors ()) %>%
step_center (all_predictors ())
I chose to use the above formula because it compares the predictors to the log transformed Q mean. Doing a log transformation on the Q mean proved to establish a more predictable relationship between the predictors p_mean and aridity, so I maintained this log transformation. Since there are many predictors, I chose to use step_scale and step_center to normalize the data on a universal scale and allow for more accurate model creation.
c_baked <- prep (rec2, c_train) |>
bake (new_data = NULL )
lm_base2 <- lm (logQmean ~ . , data = c_baked)
summary (lm_base2)
Call:
lm(formula = logQmean ~ ., data = c_baked)
Residuals:
Min 1Q Median 3Q Max
-2.7726 -0.1543 0.0502 0.2400 3.9490
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.081749 0.022536 -3.627 0.000314 ***
p_mean 0.532187 0.039029 13.636 < 2e-16 ***
aridity -0.643846 0.049819 -12.924 < 2e-16 ***
soil_depth_pelletier -0.077194 0.029114 -2.651 0.008256 **
max_water_content 0.058350 0.043250 1.349 0.177877
organic_frac 0.005265 0.025468 0.207 0.836304
frac_snow 0.287464 0.051639 5.567 4.14e-08 ***
pet_mean 0.137612 0.028827 4.774 2.35e-06 ***
soil_depth_statsgo -0.156120 0.039232 -3.979 7.88e-05 ***
elev_mean 0.034353 0.057964 0.593 0.553668
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5218 on 526 degrees of freedom
Multiple R-squared: 0.8012, Adjusted R-squared: 0.7978
F-statistic: 235.5 on 9 and 526 DF, p-value: < 2.2e-16
# Linear Model
lm_model2 <- linear_reg () %>%
set_engine ("lm" ) %>%
set_mode ("regression" )
lm_wf2 <- workflow () %>%
add_recipe (rec2) %>%
add_model (lm_model2) %>%
fit (data = c_train)
summary (extract_fit_engine (lm_wf2))$ coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.081748780 0.02253641 -3.6274094 3.141820e-04
p_mean 0.532187271 0.03902852 13.6358550 1.809987e-36
aridity -0.643845847 0.04981868 -12.9237847 2.239196e-33
soil_depth_pelletier -0.077194364 0.02911388 -2.6514630 8.256109e-03
max_water_content 0.058350264 0.04325039 1.3491270 1.778766e-01
organic_frac 0.005264932 0.02546817 0.2067260 8.363039e-01
frac_snow 0.287464074 0.05163924 5.5667757 4.141554e-08
pet_mean 0.137611570 0.02882695 4.7737118 2.347449e-06
soil_depth_statsgo -0.156119578 0.03923198 -3.9793961 7.878971e-05
elev_mean 0.034352637 0.05796409 0.5926537 5.536676e-01
lm_data2 <- augment (lm_wf2, new_data = c_test)
metrics (lm_data2, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.447
2 rsq standard 0.874
3 mae standard 0.302
# Random Forest Model
rf_model2 <- rand_forest () %>%
set_engine ("ranger" ) %>%
set_mode ("regression" )
rf_wf2 <- workflow () %>%
add_recipe (rec2) %>%
add_model (rf_model2) %>%
fit (data = c_train)
rf_data2 <- augment (rf_wf2, new_data = c_test)
metrics (rf_data2, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.290
2 rsq standard 0.946
3 mae standard 0.191
# Boost Tree
bt_model2 <- boost_tree () %>%
set_engine ("xgboost" ) %>%
set_mode ("regression" )
bt_wf2 <- workflow () %>%
add_recipe (rec2) %>%
add_model (bt_model2) %>%
fit (data = c_train)
bt_data2 <- augment (bt_wf2, new_data = c_test)
metrics (bt_data2, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.384
2 rsq standard 0.903
3 mae standard 0.241
# Neural Network
nnet_model2 <- bag_mlp () %>%
set_engine ("nnet" ) %>%
set_mode ("regression" )
nnet_wf2 <- workflow () %>%
add_recipe (rec2) %>%
add_model (nnet_model2) %>%
fit (data = c_train)
nnet_data2 <- augment (nnet_wf2, new_data = c_test)
metrics (nnet_data2, truth = logQmean, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 0.303
2 rsq standard 0.943
3 mae standard 0.203
# Workflow Set
wf2 <- workflow_set (list (rec2), list (lm_model2, rf_model2, nnet_model2, bt_model2)) %>%
workflow_map ('fit_resamples' , resamples = c_cv)
# Evaluation
autoplot (wf2)
rank_results (wf2, rank_metric = "rsq" , select_best = TRUE )
# A tibble: 8 × 9
wflow_id .config .metric mean std_err n preprocessor model rank
<chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
1 recipe_rand_fore… Prepro… rmse 0.389 0.0273 10 recipe rand… 1
2 recipe_rand_fore… Prepro… rsq 0.896 0.0107 10 recipe rand… 1
3 recipe_bag_mlp Prepro… rmse 0.399 0.0331 10 recipe bag_… 2
4 recipe_bag_mlp Prepro… rsq 0.888 0.0115 10 recipe bag_… 2
5 recipe_boost_tree Prepro… rmse 0.409 0.0312 10 recipe boos… 3
6 recipe_boost_tree Prepro… rsq 0.882 0.0106 10 recipe boos… 3
7 recipe_linear_reg Prepro… rmse 0.531 0.0335 10 recipe line… 4
8 recipe_linear_reg Prepro… rsq 0.800 0.0136 10 recipe line… 4
The best model evaluating the relationship between logQmean and the predictor variables turned out to be the Neural Network model. Similar to the previous example, the flexibility of this model and its ability to capture complex interactions and relationships makes it the best model. The relationships between variables is non-linear, making the linear model insufficient and decision-tree based models perform better than the linear, but ultimately not as well as the neural network. It ranks first compared to all other model types for its R-squared value, with >94 % of the variation in logQmean being accounted for by the predictor variables. For these reasons the Neural Network model is the best of them all.
# Extract and Evaluate
# workflow created in initial model creation above
ggplot (nnet_data2, aes (x = logQmean, y = .pred)) +
scale_color_viridis_c () +
geom_point (color = "blue" , alpha = 0.6 ) +
geom_abline () +
theme_linedraw () +
labs (
title = "Observed vs. Predicted Log Mean Streamflow" ,
x = "Observed logQmean" ,
y = "Predicted logQmean"
) +
theme_minimal ()
The model seems to be a good fit for predicting logQmean across a number of predictor variables. The line of best fit represents the overall trend and slope of the data, although there is some variation between actual and predicted values. I think this model could be fine tuned by including even more predictor variables to increase its predictive capacity. However, no model will be 100% accurate in predicting streamflow on a given day due to daily variations and the presence of outliers, so I think this model is an excellent start for making predictions.